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Abstract 

This  paper  describes  a  statistical  model  for  extraction  of 
events  at  the  sentence  level,  or  “semantic  tagging”,  typi¬ 
cally  the  first  level  of  processing  in  Information  Extrac¬ 
tion  systems.  We  illustrate  the  approach  using  a  manage¬ 
ment  succession  task,  tagging  sentences  with  three  slots 
involved  in  each  succession  event;  the  post,  person  com¬ 
ing  into  the  post,  and  person  leaving  the  post.  The  ap¬ 
proach  requires  very  limited  resources:  a  part-of-speech 
tagger;  a  morphological  analyzer;  and  a  set  of  training 
examples  that  have  been  labeled  with  the  three  slots  and 
the  indicator  (verb  or  noun)  used  to  express  the  event. 
Training  on  560  sentences,  and  testing  on  356  sentences, 
shows  the  accuracy  of  the  approach  is  77.5%  (if  partial 
slot  matches  are  deemed  incorrect)  or  87.8%  (if  partial 
slot  matches  are  deemed  correct). 

1  Introduction 

Statistical  models  have  been  used  quite  successfully  in 
natural  language  processing  for  recovery  of  hidden  struc¬ 
ture  such  as  part-of-speech  tags,  or  syntactic  structure. 
This  paper  considers  semantic  tagging  of  text  within 
the  context  of  information  extraction,  as  in  the  Sixth 
Message  Understanding  Conference  (MUC-6).  MUC-6 
looked  at  extraction  of  events  concerning  management 
successions  in  newspaper  texts:  recovering  the  post, 
company,  person  entering  and  person  leaving  the  post. 
We  will  concentrate  on  the  initial  stage  of  processing, 
extraction  of  events  at  the  sentence  level.  For  example, 
given  the  sentence 

Last  week  Hensley  West  ,  59  years  old  ,  was 

named  as  president ,  a  surprising  development. 

the  desired  output  from  the  system  would  be 

{IN  =  Hensley  West,  POST  =  president,  IND  =  named  } 

POST  is  a  slot  designating  the  title  of  the  position,  IN 
is  the  person  coming  in  to  fill  the  post,  and  IND  is  an 
“indicator”  -  usually  a  verb  or  a  noun  -  used  to  express 
the  event.  Table  1  gives  some  more  examples. 

The  traditional  approach  to  this  problem,  as  exempli¬ 
fied  in  SRTs  FASTUS  system  (Appelt  et  al.  93),  has  been 

The  work  reported  here  was  supported  in  part  by  the  Defense  Ad¬ 
vanced  Research  Projects  Agency.  Technical  agents  for  part  of  this 
work  were  Fort  Huachucha  under  contract  number  DABT63-94-C- 
0062.  The  views  and  conclusions  contained  in  this  document  are  those 
of  the  authors  and  should  not  be  interpreted  as  necessarily  represent¬ 
ing  the  official  policies,  either  expressed  or  implied,  of  the  Defense 
Advanced  Research  Projects  Agency  or  the  United  States  Government. 


(IN  Jack  Bradley)  was  (IND  named)  (POST  acting  president  and  chief 
executive  officer)  of  this  computer-network  manufacturer  . 

(IN  He)  (IND  succeeds)  as  (POST  president)  (OUT  Edward  Marinaro) 

,  56  ,  who  is  retiring  . 

(IN  Mr.  Stanley)  ’s  (IND  appointment)  comes  as  AK  Steel ’s  Mid¬ 
dletown  Works  is  under  investigation  by  OSHA  because  of  its  safety 
record  ,  which  includes  one  accident  that  killed  four  men  in  April  1994 

The  unexpected  (IND  departure)  of  (OUT  Citicorp ’s  highly  regarded 
head  of  retail  banking)  and  the  appointment  of  a  tobacco  executive  to 
fill  his  shoes  has  caught  most  Citicorp  employees  off-guard  and  con¬ 
founded  many  analysts  . 

On  Friday  ,  the  bank  said  (OUT  Pei-yuan  Chia) ,  head  of  retail  banking 
,  will  (IND  retire)  this  year  . _ 

Table  1 :  Some  example  sentences. 

to  use  hand-coded  rules.  These  are  typically  encoded  in  a 
series  of  finite-state  transducers  that  progressively  build 
information  in  a  bottom-up  fashion.  We  are  interested  in 
developing  a  machine-learning  approach  to  this  problem 
for  two  reasons:  First,  developing  hand-coded  rules  is  a 
lengthy  task  which  requires  a  fairly  considerable  amount 
of  expertise  -  and  a  new  set  of  rules  must  be  developed 
for  each  new  domain.  Annotating  training  text  examples 
such  as  those  in  table  1  can  conceivably  be  done  by  a 
non-expert.  Second,  writing  accurate  rules  is  difficult,  as 
there  are  many  complex  interactions  between  the  rules, 
and  there  are  many  details  to  be  covered.  This  task  be¬ 
comes  even  more  complex  when  the  interaction  between 
the  sentence-level  rules  and  the  later  stages  of  process¬ 
ing  (co-reference  and  merging,  see  section  1.1)  is  con¬ 
sidered.  Machine  learning  techniques  have  been  shown 
to  be  highly  effective  at  managing  this  kind  of  complex¬ 
ity  in  applications  such  as  speech  recognition,  part-of- 
speech  tagging  and  parsing. 

Not  surprisingly  this  problem  can  be  approached  us¬ 
ing  finite-state  tagging  methods,  which  have  previously 
been  applied  to  part  of  speech  tagging  (Church  88)  and 
named-entity  identification  (Bikel  et  al.  97).  We  initially 
consider  this  approach,  but  argue  that  the  Markov  ap¬ 
proximation  gives  an  extremely  bad  parameterization  of 
the  problem.  Instead,  the  method  uses  a  Probabilistic 
Context  Free  Grammar  (PCFG),  which  has  the  advan¬ 
tages  of  being  flexible  enough  to  allow  a  good  parameter¬ 
ization  of  the  problem,  while  having  an  efficient  decod¬ 
ing  algorithm,  a  variant  of  the  CKY  dynamic  program¬ 
ming  algorithm  for  parsing  with  context-free  grammars. 
The  PCFG  does  not  encode  linguistic  phrase  structure; 
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rather,  it  is  semantically  motivated,  modeling  choices 
such  as  the  choice  of  indicator,  number  of  slots,  fillers 
for  the  slots,  and  generation  of  other  “noise”  words  in 
the  sentence. 

While  this  paper  largely  concentrates  on  the  man¬ 
agement  succession  domain,  and  motivates  many  of  the 
choices  regarding  representation  with  examples  from  it, 
the  principles  should  be  general  enough  to  also  work  for 
other  domains  —  in  fact,  the  method  was  originally  de¬ 
veloped  for  an  IE  task  involving  company  acquisitions 
(identifying  the  buyer,  seller  and  item  being  bought),  and 
then  moved  to  the  management  succession  domain  with 
no  domain-specific  tuning. 

1.1  The  Complete  Information  Extraction  Task 

While  a  semantic  tagging  model  may  be  useful  for  many 
tasks,  our  primary  motivation  is  to  use  it  within  an  in¬ 
formation  extraction  (IE)  system.  Information  extrac¬ 
tion  tasks  involve  processing  an  input  text  to  fill  slots 
in  an  output  template,  the  DARPA-sponsored  Message 
Understanding  Conferences  (MUCs)  have  evaluated  IE 
systems  from  several  research  sites.  Some  previous  tasks 
attempted  at  MUC  involved  extraction  of  information 
about  terrorist  attacks  (MUC-4);  joint  ventures  (MUC- 
5);  and  most  recently,  at  MUC-6,  management  succes¬ 
sions.  In  this  paper  we  concentrate  on  the  management 
succession  task,  figure  1  gives  an  example  input-output 
pair  from  the  domain*. 

(a)  Who's  News:  Rescor  Industries  Inc. 

RESTOR  INDUSTRIES  Inc.  (Orlando,  Fla.)  — 
Hensley  E.  West,  50  years  old,  was  named 
president  of  this  telecommunications-product 
concern.  Mr.  West,  who  most  recently  was  a 
group  vice  president  for  DSC  Communications 
Corp.  in  Dallas,  fills  a  vacancy  created  by 
the  retirement  last  September  of  John 
Bradley,  63. 

(b) _ _ _ _ _ _ _ 


Event 

Number 

Slot 

Filler 

1 

IN 

OUT 

POST 

COMPANY 

Hensley  E.  West 

John  Bradley 
president 

RESTOR  INDUSTRIES  Inc. 

2 

OUT 

POST 

COMPANY 

Hensley  E.  West 
group  vice  president 

DSC  Communications  Corp. 

Figure  1;  (a)  A  sample  text  from  WSJ,  involving  man¬ 
agement  successions,  (b)  The  two  succession  events  in 
the  text. 

Most  systems  described  in  the  MUC-6  proceedings 
followed  the  following  three  stages  of  processing  in  map¬ 
ping  an  input  text  to  a  set  of  output  templates: 

*  We’ve  just  shown  the  most  important,  “cote”  slots  for  the  task  - 
the  MUC-6  specification  includes  additional  information  such  as  the 
reason  for  the  change,  the  title  of  the  people  involved  etc. 


1)  Pattern  matching  at  the  sentence  level.  This  is  the 
task  that  is  approached  in  this  paper.  In  the  text  in  fig¬ 
ure  1,  “Hensley  E.  West,  50  years  old,  was  named  presi¬ 
dent  of  this  telecommunications-product  concern”  would 

be  processed  to  give  {  IN  =  “Hensley  E.  West  ,  POST 
=  “president”.  COMPANY  =  “this  telecommunications- 
product  concern”,  VERB  =  “named”  }  and  ....  the  re¬ 
tirement  last  September  of  John  Bradley  ....”  would  give 
{  OUT  =  “John  Bradley”,  NOUN  =  “retirement”  .  IND 
=  “resignation”) 

2)  Coreference.  Pronouns  and  definite  NPs  are 
resolved  to  their  antecedents.  For  example,  “this 
telecommunications-product  concern”  would  be  re¬ 
solved  to  “RESTOR  INDUSTRIES  Inc.”.  This  stage  is 
important  because  pronouns  and  definite  noun-phrases 
like  “this  telecommunications-product  concern”  are  not 
informative  slot-fillers. 

3)  Merging.  The  information  in  a  template  may  be 
spread  across  several  sentences.  In  the  merging  stage  the 
information  from  multiple  mentions  of  the  same  event  is 
merged  into  a  single  template.  In  the  example,  the  infor¬ 
mation  centered  around  “named”  and  “retirement”  would 
be  identified  as  referring  to  the  same  event,  and  would 
be  combined  to  give  {  IN  =  “Hensley  E.  West”,  OUT 
=  “John  Bradley”,  POST  =  “president”,  COMPANY  = 
“RESTOR  INDUSTRIES  Inc.”  } 

This  paper’s  work  attacks  problem  (1)  alone,  and  is 
restricted  to  recovery  of  the  IN,  OUT  and  POST  slots. 

1.2  Previous  Work 

The  majority  of  systems  at  MUC-6,  including  SRl’s  sys¬ 
tem  FASTUS  (Appelt  et  al.  93),  and  the  best-performing 
system,  from  NYU  (Grishman  95),  used  cascaded  finite- 
state  transducers,  which  were  built  by  hand.  The  domain 
independent  transducers  tokenize  the  text,  recognise  per¬ 
son  and  company  names,  “chunk”  noun  and  verb  groups, 
and  finally  build  some  higher  level,  complete  clauses. 
The  domain  specific  rules  then  extract  the  slots  from  the 
sentence,  using  patterns  such  as  the  example  on  page  244 
of  (Appelt  et  al.  93):  “Company  hires  or  recruits  person 
from  company  as  position”. 

There  have  been  a  number  of  machine  learning  ap¬ 
proaches  to  the  sentence-level  stage  of  information  ex¬ 
traction.  The  AutoSlog  system  (Riloff  93;  Riloff  96) 
automatically  learned  “concept  node”  definitions  for  use 
on  the  MUC-4  terrorist  events  domain.  A  concept  node 
specifies  a  trigger  word,  usually  a  verb,  and  maps  syn¬ 
tactic  roles  with  respect  to  this  trigger  to  semantic  slots 
-  for  example,  a  concept  node  might  specify  if  trigger 
=  “destroyed”  and  syntax  =  direct-object  then  concept 
=  Damaged-Object  (Damaged-object  is  the  name  of  the 
slot  in  this  case).  A  concept  node  may  also  specify  hard 
or  soft  constraints  on  the  slot-fillers.  The  system  uses 
the  CIRCUS  parser  (Lehnert  et  al.  93)  to  find  the  syn¬ 
tactic  roles  in  relation  to  the  trigger.  AutoSlog  learns 
concept  nodes  given  input-output  pairs  like  those  in  fig¬ 
ure  1,  so  the  indicator  words  do  not  need  to  be  specified. 
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Experiments  showed  that  running  AutoSlog  followed  by 
5  hours  of  filtering  the  rules  by  hand  gave  a  system  that 
performed  as  well  as  a  hand-crafted  system. 

The  CRYSTAL  system  (Soderland  et  al.  95)  also 
learns  rules  that  map  syntactic  frames  to  semantic  roles. 
The  triggers  can  be  more  complicated  than  those  in  Au¬ 
toSlog,  in  that  they  can  specify  whole  sequences  of 
words,  or  restrict  patterns  by  specifying  words  or  classes 
in  the  surrounding  context.  CRYSTAL  learns  patterns 
by  initially  specifying  a  maximally  detailed  pattern  for 
each  training  example,  then  progressively  simplifying 
and  merging  patterns  until  some  error  bound  is  exceeded. 
CRYSTAL  uses  the  BADGER  sentence  analyzer  to  give 
syntactic  information. 

(Califf  and  Mooney  97)  describe  a  system  for  extrac¬ 
tion  of  information  about  job  postings  from  a  newsgroup. 
Relational  learning  is  used  to  learn  rule-based  patterns 
that  specify:  1)  a  pre-filler  pattern  that  matches  the  text 
before  the  slot;  2)  a  pattern  that  must  match  the  actual 
slot  filler;  and  3)  a  post-filler  pattern  that  matches  the  text 
after  the  slot.  The  patterns  can  involve  parts  of  speech, 
semantic  classes  of  words,  or  the  words  themselves.  An 
example  pattern  from  (Califf  and  Mooney  97)  for  identi¬ 
fying  locations  is  pre-filler  =  in,  filler  =  2  or  fewer  words 
all  proper  nouns,  post-filler  =  word}  is  word!  is  a 
state.  This  matches  phrases  like  “in  Kansas  City,  Mis¬ 
souri”  or  “in  Atlanta,  Georgia”.  The  learning  algorithm 
starts  with  the  most  specific  rule  for  each  training  exam¬ 
ple,  then  generalizes  by  merging  similar  rules. 

A  major  difference  between  the  approaches  described 
in  (Riloff  93;  Riloff  96;  Soderland  et  al.  95)  and  the  ap¬ 
proach  in  this  paper  is  that  (Riloff  93;  Riloff  96;  Soder¬ 
land  et  al.  95)  rely  on  a  syntactic  parser  to  produce  at  least 
a  shallow  syntactic  analysis.  The  approach  described  in 
this  paper  builds  a  system  from  a  set  of  training  exam¬ 
ples,  with  only  a  part-of-speech  tagger  and  a  morpho¬ 
logical  analyzer  as  additional  resources.  The  system  in 
(Califf  and  Mooney  97)  does  not  require  a  parser,  but  the 
patterns  it  uses  are  quite  local  (the  pre-filler  and  post¬ 
filler  patterns  are  adjacent  to  the  slot).  It  isn’t  clear  this 
method  would  work  well  for  the  management  succes¬ 
sions  domain  where  there  are  often  many  “noise”  words 
between  the  slots  and  the  indicator.  Another  major  dif¬ 
ference  between  the  methods  is  that  the  PCFG  based 
method  is  probabilistic.  This  may  be  an  advantage  when 
the  sentence-level  stage  of  processing  is  combined  with 
the  later  merging  and  coreference  stages,  as  it  gives  a 
principled  way  of  combining  evidence  from  the  differ¬ 
ent  stages  of  processing;  an  uncertainty  at  the  sentence 
level  may,  for  example,  be  resolved  at  the  merging  stage 
—  in  this  case  it  is  useful  for  the  sentence  level  system 
to  be  capable  of  giving  a  list  of  candidate  analyses  with 
associated  probabilities. 

2  Background 

2.1  The  Problem 

We  assume  the  following  definitions: 


1.  A  sentence,  W,  consists  of  n  words,  iUi,W2,  ■■■iVn- 

2.  A  template,  T,  is  a  non-empty  set  of  slots,  where 
each  slot  is  a  label  together  with  a  tuple  giving  the 
start  and  end  point  of  the  slot  in  the  sentence.  For 
example,  T  =  {IN  =  (3, 4), OUT  =  (5,6)}  means 
there  is  an  IN  slot  spanning  words  3  to  4  inclusive, 
and  an  OUT  slot  spanning  words  5  to  6  inclusive. 

In  the  management  succession  domain  there  are 
three  possible  slots,  IN,  OUT  and  POST  (abbrevi¬ 
ated  to  I,  O  and  P  respectively).  IN  is  the  string 
denoting  the  person  who  is  filling  the  post,  OUT  is 
the  person  who  is  leaving  the  post,  and  POST  is  the 
name  of  the  post. 

3.  In  addition,  we  assume  that  each  template  contains 
an  additional  indicator  slot,  which  is  the  verb  or 
noun  used  to  express  the  template. 

For  example,  a  (ly,  T)  pair  might  be 

W  =  Last  week  Hensley  West ,  59  years  old  ,  joined 
the  company  as  president ,  a  surprising  development . 

T  =  (IN  =  (3, 4), POST  =  (14, 14),  IND  =  (10, 10)} 

As  alternative  notation  in  this  paper  we  either  list  the 
strings  in  the  template,  for  example  T  =  {IN  =  “Hens¬ 
ley  West”,  POST  =  “president”,  IND  =  “joined”},  or  we 
show  the  {W,  T)  pair  as  a  bracketed  sentence; 

Last  week  (IN  Hensley  West)  ,  59  years  old  , 

(IND  joined)  the  company  as  (POST  president) 

,  a  surprising  development . 

Table  1  shows  more  examples  from  the  management  suc¬ 
cession  domain.  The  machine  learning  task  is  to  learn  a 
function  that  maps  an  arbitrary  sentence  VF  to  a  template 
T,  given  a  training  set  of  N  pairs  iyVi,Ti)  \  <  i  <  N. 
A  test  set  of  {W,  T)  pairs  is  used  to  evaluate  the  model. 

In  addition  to  a  training  set,  we  assume  the  following 
resources; 

1 .  A  part  of  speech  (POS)  tagger.  The  POS  tagger  de¬ 
scribed  in  (Ratnaparkhi  96)  was  used  to  tag  both 
training  and  test  data. 

2.  A  lexicon  which  maps  each  indicator  word  in  train¬ 
ing  data  to  a  class,  for  example  the  morphologi¬ 
cal  variants  “join”,  “joins”,  “joined”  and  “joining” 
could  all  be  mapped  to  the  JOIN  class.  This  can  be 
done  automatically  by  a  morphological  analyzer  as 
in  (Karp  et  al.  94),  or  by  hand.  (This  resource  is 
not  strictly  necessary,  but  will  help  to  reduce  sparse 
data  problems). 

A  probabilistic  approach  defines  a  conditional  proba¬ 
bility  P(r  I  W)  or  a  joint  probability  P{T,  W)  for  every 
candidate  template  for  a  sentence.  The  most  likely  tem¬ 
plate  for  a  sentence  W  is  then 

Tbeat  =  argm^P(r  |  W)  =  argm^P(T,  JF)  (1) 
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The  major  part  of  this  paper  will  be  concerned  with  for¬ 
malizing  a  stochastic  model  that  defines  P{T,  W). 

3  A  Probabilistic  Model 

3.1  A  Naive  Approach  —  Finite  State  Tagging 
It  is  useful  to  note  that  a  (VF,  T)  pair  can  be  represented 
as  a  tagged  sentence  Wi/ti,W2/t2,  ...Wn/tn  where  T  = 
ti,t2-.tn  is  the  sequence  of  tags  denoting  the  semantic 
type  for  each  word  in  the  sentence.  For  example,  the  tags 
could  be  I,  O,  P,  IND,  for  the  3  slots  and  the  indicator, 
and  N  for  other  (noise)  words,  as  in 

Last/N  week/N  Hensley/I  West/I  JH  59/N 
years/N  old/N  yN  Joined/IND  the/N  com- 
pany/N  as/N  president/P  ,/N  a/N  surprising/N 
development/N  ./N 

As  a  straw  man  we  consider  using  a  standard  bigram 
tagging  model  to  tag  test-set  sentences.  (Church  88)  used 
this  to  recover  part-of-speech  tags,  a  related  approach  de¬ 
scribed  in  (Bikel  et  al.  97)  gives  a  useful  decomposition 
of  P(T,  W)  into  two  terms: 

P(r,W)  =P(Z,1.Z,2,...I„)  X  J]  P{Wi\Li)  (2) 

{LuL2,...L  m  }  is  the  underlying  sequence  of  tags,  in  the 
above  example  m  =  7  and  the  sequence  is  {N,  I,  N,  IND, 
N  ,  P,  N}.  Wi  is  the  string  of  words  under  label  Li,  for 
example  IFi  =  {Last,  week},  VF2  =  {Hensley,  West). 
The  two  terms  are  then  simplified,  using  bigram  Markov 
independence  assumptions,  to  be 

P(ii,L2,..Xm)  =  P{Li\Start)PiEnd\L„) 

X  H  PiLi\Li-i)  (3) 

»s£2...m 

and  (if  label  Li  covers  words  w,i...Wei) 

P{Wi\Li)  =  = 

P{w,i  I  Start,  Li) P {End  |  w,i,Li) 

X  jQ  Piwj  \wj-i,Li)  (4) 

This  finite  state  approach  has  been  highly  effective  for 
part  of  speech  tagging  (Church  88)  and  name  finding 
(Bikel  et  al.  97).  However,  the  next  section  considers 
the  characteristics  of  the  task  in  more  detail,  and  argues 
that  a  finite-state  tagger  is  a  poor  model  for  the  task. 

3.2  More  about  the  task 

In  developing  an  intuition  for  the  task,  and  motivating 
the  choices  made  in  modeling  it,  it  is  useful  to  consider 
the  types  of  information  that  may  be  useful  to  a  system. 
Consider  the  following  5  points: 

1 .  There  are  7  possible  templates  corresponding  to  the 
7  non-empty  subsets  of  {I,0,P}.  The  distribution 
over  these  alternatives  is  by  no  means  uniform  —  see 
table  2  for  the  distribution. 


2.  The  different  slots  tend  to  contain  quite  different 
lexical  items  or  strings  -  for  example,  the  IN  and 
OUT  slots  are  most  likely  to  contain  a  proper  name 
or  a  personal  pronoun,  whereas  the  POST  slot  con¬ 
tains  strings  such  as  “president”,  “chairman”  etc. 

3.  The  choice  of  indicator  word  depends  greatly  on  the 
choice  of  template.  For  example  “name”  is  very 
likely  to  be  used  to  express  an  event  involving  a 
{I,  P}  template;  “succeed”  is  very  likely  to  express 
an  {I,  O,  P}  or  {1, 0}  template.  See  the  final  column 
of  table  2  for  more  examples. 

4.  The  relative  order  of  the  slots  and  indicator  in  the 
text  varies  considerably  depending  on  the  choice  of 
indicator.  For  example,  given  the  template  {1}  and 
the  verb  “Join”  the  order  is  most  likely  to  be  {  I 
Indicator  }  (e.g.  IN  joined  the  company)-,  whereas 
given  the  verb  “hire”  the  order  is  usually  {Indicator 
1}  (e.g.  the  company  hired  IN). 

5.  In  addition  to  the  central  indicator,  there  are  often 
secondary  indicators  -  mainly  prepositions  -  which 
are  strong  signals  of  particular  slots.  For  example, 
given  the  verb  is  “named”  or  “succeeded”,  the  post 
is  very  likely  to  be  preceded  by  the  preposition  “as” 
(e.g.,  the  company  named  her  as  president,  he  suc¬ 
ceeds  Jim  Smith  as  president). 

By  considering  points  1-5  we  can  see  that  the  finite- 
state  tagging  approach  is  deficient  for  the  semantic  tag¬ 
ging  task.  The  lexical  probabilities  in  equation  (4)  are 
probably  sufficient  to  capture  the  lexical  differences  be¬ 
tween  different  states  (the  preference  of  the  IN  slot  to 
generate  proper  names,  of  the  POST  slot  to  generate 
words  like  “president”  and  so  on).  But  the  Markov  ap¬ 
proximation  in  equation  (3)  is  deficient  in  many  ways:  it 
fails  to  capture  the  non-uniform  distribution  over  the  7 
possible  templates,  worse  still  it  is  deficient  in  that  it  can 
label  more  than  one  subsU-ing  with  the  same  slot  label; 
it  fails  to  capture  the  dependence  of  the  slot  order  on  the 
indicator  word,  or  the  dependence  between  the  template 
and  indicator. 

3.3  A  Probabilistic  Context-Free  Grammar 

Our  proposal  is  to  replace  the  Markov  assumption  in 
(3)  with  a  probabilistic  context-free  grammar,  that  is  we 
assume  that  the  label  sequence  has  been  generated  by 
the  application  of  r  context-free  rules  LHSj  =>  RHSj 
1  $  J  <  f  (LHS  stands  for  left  hand  side,  RHS  stands 
for  right  hand  side),  and  that 

P{LiL2....Lm)  =  JJ  P{RHSi\LHSi).  (5) 

i=l..r 

Each  LHS  is  a  single  non-terminal,  and  RHS  is  a  string 
of  one  or  more  non-terminals.  So  for  each  non-terminal 
LHS  in  the  grammar  there  is  a  distribution  over  possible 
RHSs  which  sums  to  one.  Counts  of  context-free  rules 
can  be  extracted  from  a  training  set  of  context  free  trees. 
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Template 

%age 

Typical  example 

Most  frequent  indicator/percentage 

{I.P} 

45.6 

IN  was  named  as  POST 

name/67% 

(O) 

18.6 

OUT  retired 

retire/26% 

■  (i.o) 

16.6 

IN  succeeds  OUT 

succeed/98% 

{1} 

7.3 

IN  joined  the  company 

Join/3 1% 

{O.P} 

7.1 

OUT  resigned  as  POST 

resign/46% 

[I.O,P} 

3.5 

IN  succeeded  OUT  as  POST 

succeed/ 1 00% 

.  (P)  . 

1.3 

the  company  hired  PtDST 

- 

Table  2:  The  distribution  over  the  possible  templates  (the  7  non-empty  subsets  of  {I,0,P}),  and  the  most  common 
indicator  for  each  template.  For  example,  45.6%  of  the  templates  are  snd  in  these  45.6%  of  the  cases  name  is 
chosen  67%  of  the  time. 


and  used  to  estimate  the  parameters  P{RHSi  |  LHSi). 
Given  a  test  data  sentence,  the  most  likely  tree  (and  hence 
the  most  likely  template)  can  be  recovered  efficiently  us¬ 
ing  a  variant  of  the  CKY  algorithm. 

4  The  Grammar 

This  section  describes  the  underlying  context-free 
structure^  that  we  assume  has  generated  the  labels,  and 
motivates  it  in  terms  of  the  observations  in  section  3.2. 
The  context-free  structure  (the  tree  topology,  and  the 
choice  of  non-terminal  labels  within  the  tree),  is  deter¬ 
ministically  derived  from  the  initial  labeling  of  the  sen¬ 
tences  —  so  given  a  set  of  labeled  sentences,  the  context- 
free  structures  can  be  recovered  and  the  parameters  can 
be  estimated. 

4.1  The  Leaf  Categories 

The  tagging  model  as  applied  in  the  above  example  as¬ 
sumed  five  tags  -  for  the  IN,  OUT,  and  POST  slots,  the 
indicator,  and  for  noise  (other  words).  In  fact,  we  used 
rather  more  categories,  which  are  listed  in  table  3.  These 
labels  can  still  be  deterministically  recovered  from  the  la¬ 
beled  sentence  though,  given  the  additional  information 
of  a  mapping  from  indicator  words  to  their  morphologi¬ 
cal  stem  (for  example,  the  mapping  "joined”  =>  JOIN). 
The  example  sentence  would  have  the  following  under¬ 
lying  leaf  labels; 

[PREN  Last  week  ]  [I  Hensi^  West  ] 
[NOISE+ ,  59  years  old  ,  ]  [IND( JOIN)  joined 
]  [NOISE-  the  company  ]  [P.Prep-(JOIN)  as  ] 

[P  president  ]  [POSTN  ,  a  surprising  develop¬ 
ment  .  ] 

4.2  The  Context-Free  Component  —  a  Brief  Sketch 
The  PCFG  model  assumes  the  pre-terminal®  label  se¬ 
quence  {Li,L2...Lm}  has  been  generated  by  a  stochas¬ 
tic  process  with  the  following  steps: 

^The  structures  in  this  paper  are  non-recursive,  and  could,  there¬ 
fore,  be  equivalently  handled  by  a  hierarchy  of  finite-state  transduc¬ 
ers,  or  even  a  single  equivalent  non-deterministic  finite-state  automa¬ 
ton.  However,  it  is  quite  possible  that  extensions  to  the  models  could 
require  recursive  structures. 

^By  pre-terminal,  we  mean  a  non-terminal  that  dominates  words 
rather  than  other  non-terminals. 


1.  Decide  whether  to  have  noise  words  (PREN)  before 
the  template  TEMP. 

2.  Decide  whether  to  have  noise  words  (POSTN)  after 
the  template  TEMP. 

3.  Decide  which  slots  to  have  (one  of  the  7  subsets  of 
{I,  O,  P}). 

4.  Decide  the  class  of  indicator  words, 

5.  Decide  the  order  of  the  slots  and  indicator  word. 

6.  For  each  slot,  choose  whether  to  have  noise  between 
it  and  the  indicator  (NOISE+  or  NOISE-). 

7.  For  each  slot,  choose  whether  to  have  a  preposition 
directly  preceding  or  following  it. 

Figure  2  gives  an  example  tree  and  describes  the 
context-free  rules  within  it.  The  next  section  describes 
the  grammar  in  more  detail,  showing  how  these  7  types 
of  decision  can  be  encoded  as  context-free  rules. 

4.3  The  Context-Free  Component  in  Detail 
This  section  describes  the  top-down  derivation  of  a  se¬ 
quence  of  leaves  within  a  PCFG  framework. 

Choosing  noise  at  the  start/end  of  the  sentence 
This  level  of  the  model  chooses  whether  to  have  noise 
preceding  or  following  the  text  which  expresses  the  suc¬ 
cession  information. 

TOP  ->  PREN  TEMPI  #there  is  noise  at  start 
->  TEMPI  #or  there  isn't 

TEMPI  ->  TEMP  POSTN  #there  is  noise  following 
TEMP  #or  there  isn't 

The  TEMP  non-terminal  covers  the  span  of  the  succes¬ 
sion  information,  in  the  above  example  “Hensley ...  pres¬ 
ident”.  P(PREN  TEMPI  1  TOP)  can  be  interpreted  as 
the  probability  of  having  noise  at  the  beginning  of  the 
sentence,  P(TEMP  POSTN  |  TEMPI)  is  the  probability 
of  having  post-noise. 

P{Slots) 

TEMP  first  re-writes  in  one  of  seven  ways,  correspond¬ 
ing  to  the  7  possible  templates.  The  T .  non-terminal 
encodes  the  slots  that  will  be  generated  below  it,  for  ex¬ 
ample  T .  lO  would  generate  an  IN  and  an  OUT  slot  be¬ 
low  it.  So  P(RHS  I  TEMP)  will  mirror  the  distribution 
in  column  2  of  table  2. 
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Leaf  label 

Description 

I.O.P 

The  IN,  OUT  and  POST  non- terminals 

PREN 

“Noise”  words  before  the  template 

POSTN 

“Noise”  words  after  the  template 

N01SE+ 

“Noise”  between  slots  and  the  indicator,  which  comes  before  the  indicator 

NOISE- 

“Noise”  between  slots  and  the  indicator,  which  comes  after  the  indicator 

IND(class) 

Leaf  dominating  the  indicator,  class  can  be  any  one  of  the  morphological  stems  seen  in  training  data. 
For  example,  lND(join)  could  dominate  “join”,  “Joins”,  “joined”  or  “joining”. 

l.Prep-(cIass) 

O. Prep-(class) 

P. Prep-(class) 

Prepositions  for  the  I,  O  and  P  slots,  for  a  particular  class,  and  which  follow  the  indicator.  For  example, 
P.Prep-(join)  would  be  a  preposition  for  the  Post  slot,  with  an  indicator  in  the  join  class,  and  would 
most  likely  be  “as” 

l.Prep+(class) 

O. Prep+(class) 

P. Prep+(class) 

Prepositions  for  the  I,  O  and  P  slots,  for  a  particular  class,  and  which  precede  the  indicator. 

Table  3:  The  pre-terminal  labels  that  are  used  in  the  system. 


TOP 


Rule 

Interpretation 

TOP  =>  PREN  TEMPI 

Choose  to  have  pre-noise  (PREN) 

TEMPI  =>■  TEMP  POSTN 

Choose  to  have  post-noise  (POSTN) 

TEMP  =f  T.IP 

Choose  to  have  IN  and  POST  slots 

TIP  =i.  IND.IP(JOIN) 

Choose  to  use  a  member  of  the  JOIN  class  of  indicators 

IND.IP(JOIN)  =>  IND.I(jOlN)  P2-(JOIN) 

Generate  the  POST  slot  to  the  right  of  the  indicator 

IND.I(JOIN)  =>  I2+(JOIN)  IND(JOIN) 

Generate  the  IN  slot  to  the  left  of  the  indicator 

I2+(J01N)  =>  ll+(JOIN)  N01SE+ 

Have  noise  between  the  IN  slot  and  the  indicator 

ll+(JOIN)=M 

Choose  not  to  have  a  preposition  for  the  IN  slot 

P2-(JOIN)  =>  NOISE-  PI -(JOIN) 

Have  noise  between  the  POST  slot  and  the  indicator 

Pl-(JOIN)  =>■  P.Prep-(JOIN)  P 

Choose  to  have  a  preposition  attached  to  the  POST  slot 

Figure  2:  An  example  context-free  tree.  The  table  shows  the  interpretation  of  each  of  the  rules  in  the  tree. 


TEMP  ->  T.IOP 


TEMP  ->  T.IO 
TEMP  ->  T.IP 
TEMP  ->  T.OP 


TEMP  ->  T.I 
TEMP  ->  T.O 
TEMP  ->  T.P 


P{Class\Slots) 

The  next  step  is  to  choose  the  Class  of  indicator  that 
is  used  to  express  the  transaction.  Each  Class  is  a  set 
of  words  with  the  same  morphological  stem,  for  exam¬ 
ple  the  JOIN  class  would  include  join,  joins,  joined  and 
joining.  P{Class\Slots)  is  implemented  in  the  CFG 
fragment  shown  below.  The  IND  non-terminal  encodes 
which  slots  need  to  be  generated,  and  the  Class  used  to 
express  the  transaction.  Each  T .  rule  can  re-write  in  N 


P{Order\Class,  Slots) 

Having  chosen  the  Slots  to  be  generated,  and  the  Class 
used  to  express  the  event,  there  are  many  possible  orders 
in  which  the  slots  and  class  can  appear.  In  the  above  ex¬ 
ample  (Slots  =  {I,  P},  Class  =  JOIN)  there  are  6  permuta¬ 
tions  (  {JOIN  I P),  (I  JOIN  P},  {I  P  JOIN  }  and  so  on).  It 
is  necessary  to  estimate  a  distribution  over  these  alterna¬ 
tives.  The  order  is  parameterized  using  a  binary  branch¬ 
ing,  context-free  fragment:  part  of  this  (all  rules  with 
LHS  =  IND.  IP  [Class]  )  is  shown  below.  The  full 
grammar  specifies  similar  rules  for  all  IND.X[Class] 
where  X  is  any  one  of  the  non-empty  subsets  of  {I,  O, 


ways, 

where  N  is  the  number  of  classes. 

P}- 

T.IOP 

->  IND.IOPlClassl 

T.I  ->  IND. I [Class) 

IND.IP[Class) 

->  IND. I [Class] 

P2- [Class] 

T.IO 

->  IND. 10 [Class] 

T.O  ->  IND.O[Class] 

IND.IP[Class] 

->  P2+ [Class] 

IND. I [Class] 

T.  IP 

->  IND.IP[ClassJ- 

T.P  ->  IND.P[Class] 

IND. IP [Class] 

->  IND.P[Class] 

12- [Class] 

T.OP 

->  IND.OP(Class) 

IND.IPlClass) 

->  12+ [Class] 

IND.P[Class] 
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The  notation  is: 

•  IND  keeps  tracks  of  which  slots  still  need  to  be 
generated.  For  example  IND.IP[Class\  means 
that  the  IN  and  POST  slots  need  to  be  generated. 

•  The  12,  02,  and  P2  non-terminals  will  eventually 
generate  the  IN,  OUT  and  POST  leaves.  Tlie  “2” 
stands  for  level  2  -  more  in  the  next  section  on  why 
this  is  necessary.  “+”  means  the  slot  appears  be¬ 
fore  the  head- word,  means  it  appears  after.  The 
Class  is  propagated  to  the  12,  02  and  P2  non¬ 
terminals.  Propagation  of  the  Class  and  direction 
(-F  or  -)  is  important  because  the  identity  of  any 
prepositions  is  conditioned  on  this  information. 

Each  binary  rule  expresses  a  choice  of  which  of  the 
remaining  slots  to  generate  next,  and  which  direction  to 
generate  it  in.  So  IND.IP[Class]  can  re-write  in  4  ways: 
either  the  IN  or  POST  slot  can  be  generated  either  to  the 
left  or  right  of  the  head-word  itself. 

Choosing  to  generate  noise  between  the  slots 
Noise  can  appear  after  any  slot  preceding  the  indicator, 
or  before  any  slot  following  the  indicator.  The  CFG  rules 
below  encode  the  decision  to  have  noise  in  a  gap  or  not, 
for  an  IN  slot  generated  before  or  after  the  indicator.  The 
rules  for  OUT  and  POST  are  similar. 

12+ [Class]  ->  11+ [Class]  NOISE+ 

12+ [Class]  ->  11+ [Class] 

12- [Class]  ->  NOISE-  II- [Class] 

I2-[Class]  ->  ll-[class] 

Choosing  to  generate  a  preposition  (or  other 
indicator)  linked  to  a  slot 

Any  of  the  slots  can  have  an  adjacent  “indicator”,  usu¬ 
ally  a  preposition.  The  rules  below  encode  the  binary 
decision  of  whether  to  include  an  indicator  for  an  IN  slot 
-  the  OUT  and  POST  cases  are  similar. 

Il+[Class]  ->  I  I . Prep+ [class] 

11+ [Class]  ->  I 

II- [Class]  ->  I. Prep- [Class]  I 

II- [Class]  ->  I 

Again,  for  each  Jl,  01  or  PI  non-terminal  there  are 
two  possible  re-writes,  one  binary,  one  unary,  encoding 
whether  or  not  to  generate  a  preposition.  The  I. Prep, 
O.Prep  and  P.Prep  non-terminals  then  generate  the  in¬ 
dicator  with  a  bigram  model.  The  non-terminal  encodes 
whether  the  slot  appears  before  or  after  the  head-word 
(“+”  or  and  the  Class  of  the  head- word. 

5  IVaining  the  Model 

There  are  two  steps  to  training  the  model:  first,  recov¬ 
ering  the  underlying  tree  structure  from  the  training  data 
labels;  second,  deriving  counts  of  the  CF  rule  applica¬ 
tions  and  bigram  sequences  and  using  these  to  estimate 
the  parameters  of  the  model. 


5.1  Deriving  the  Tree  Structures  in  Training  Data 
While  the  tree  structure  described  in  section  4.3  may 
seem  complex,  it  is  important  to  realise  that  it  can  be 
deterministically  derived  from  an  annotator’s  labeling  of 
the  Slots  and  Indicator.  This  section  describes  how  the 
structure  is  derived  in  a  bottom  up  fashion  using  the  fol¬ 
lowing  annotated  sentence  as  example  input  to  the  pro¬ 
cess: 

Last  week  [I  Hensley  West  ]  ,  59  years  old  , 

[IND  joined  ]  the  company  as  [P  president  ] ,  a 
surprising  development . 

The  6  stages  are  as  follows: 


1 .  Identify  the  class  of  the  indicator,  and  add  this  infor¬ 
mation  to  the  IND  label.  Mark  any  prepositions  ad¬ 
jacent  to  the  slots.  Label  “noise”  words  with  either 
PREN,  POSTN,  NOISE-i-  or  NOISE-.  The  output 
from  this  stage  would  be: 

[PREN  Last  week  ]  [I  Hensley  West  ]  [NOISE-t ,  59 
years  old , )  [IND(JOIN)  joined  ]  [NOISE-  the  com¬ 
pany  ]  [P.Prep-(JOIN)  as  ]  [P  president  ]  [POSTN  , 
a  surprising  development .  ] 

2.  Build  level  1  of  the  slots,  by  including  attached 
prepositions,  or  just  building  a  unary  rule 


I 

I 

I 


3.  Build  level  2  of  the  slots,  by  attaching  NOISE+  or 
NOISE-  leaves  to  the  slots. 


NOISEi* 

\ 

,  59  years  old . 


NOISE* 

I 

the  company 


RPrrp<X>lN) 

I 


4.  Build  the  binary-branching  context-free  structure 
that  defines  the  order  of  the  slots  and  indicator. 


INDOOIN)  P2-(J0IN) 


5.  Add  the  top  level  of  the  tree 


TEMP 

I 

T.IP 

I 

IND(JOIN)JP 


POSTN 

I 

.  a  suiprisiBf  devriopmeni . 


5.2  Context  Free  Rule  Probabilities 
Once  the  training  data  is  processed  to  have  full  context- 
free  trees,  the  grammar  can  be  automatically  read  from 
these  trees,  and  event  counts  can  be  extracted  and  used 
to  estimate  the  parameters  of  the  model.  The  maximum 
likelihood  estimate  for  a  CF  rule  LHS  ->  RHS  is 


P{RHS\LHS)  = 


C{RHS,LHS) 

C{LHS) 
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where  C{x)  is  the  number  of  times  event  x  has  been  seen 
in  training  data.  This  estimate  can  be  unreliable,  partic¬ 
ularly  for  low  values  of  C{LHS).  So  we  smooth  this 
estimate  with  a  “backed  off’  estimate  Pf, 

P{RHS\LHS)  =  +  (1  -  A)Ps 

where  0  <  A  <  1.  The  backed  off  estimate  Pj,  = 
**  based  on  a  subset  of  the  context  and  the 
estimate  is  more  robust  but  is  less  detailed.  For  example, 
Pft  for  P(T.IOP  IND.IOP(Class]|r./OP)  might  be 
P{T  -4  IND  [Class]|T),  i.e.  an  estimate  that  ignores 
the  slots  when  choosing  the  class  of  indicators.  This 
method  borrows  heavily  from  smoothing  techniques  in 
language  modeling  for  speech  recognition  —  (Jelinek 
90)  describes  methods  for  estimating  A. 

5.3  Bigram  ProbabUities 

The  bigram  model  is  used  at  the  leaves  of  the  tree 
to  generate  the  words  themselves,  for  example  to  es¬ 
timate  P(the  president  |  P).  The  most  obvi¬ 
ous  way  to  estimate  this  is  as  P(the\START,P) 
*P{president\the,P)  ♦  P{EN D\president,  P)  with 
smoothing  being  implemented  by  interpolation  between 
P{w\w -Instate)  P{v}\State)  ->  y  where  V  is  the 
vocabulary  size.  Unfortunately  we  do  not  have  space  to 
go  into  the  full  details  of  the  smoothing  here  (in  the  fi¬ 
nal  implementation  part-of-speech  information  was  also 
used  to  smooth  the  estimates). 

6  Experiments 

This  section  describes  experiments  on  the  management 
successions  domain.  Before  giving  the  results,  we  dis¬ 
cuss  how  to  deal  with  sentences  that  have  more  than  one 
indicator. 

6.1  Dealing  with  Sentences  that  have  more  than 
one  Indicator 

Thus  far  the  model  has  assumed  that  there  is  only  one 
indicator  per  sentence.  However,  training  data  frequently 
has  more  than  one  indicator,  as  in 

Mr.  Smith  was  named  president  of  the  com¬ 
pany,  succeeding  Fred  Jones. 

There  are  two  events  in  this  sentence,  one  centered 
around  named,  the  other  centered  around  succeeding. 
The  solution  is  to  transform  sentences  in  both  training 
and  test  data  to  give  one  sentence  per  indicator,  in  this 
case  the  sentence  would  be  expanded  to  give  two  sen¬ 
tences; 

Mr.  Smith  was  *named*  president  of  the  com¬ 
pany,  succeeding  Fred  Jones. 


The  first  sentence  is  for  the  named  event,  the  second  is 
for  succeeding.  The  indicator  is  replaced  with  *indica- 
tor*  to  show  that  it  is  under  interest  —  when  decoding 
test  data  the  model  either  recognises  *named*  as  a  poten¬ 
tial  indicator,  but  ignores  succeeding,  or  ignores  named 
and  recognises  ^succeeding*.  If  the  sentence  appeared  in 
training  data  it  would  be  u-ansformed  to  give  two  train¬ 
ing  data  trees.  We  should  stress  that  this  process  is  com¬ 
pletely  automatic  Once  the  indicators  have  been  identified 
in  the  text. 


6.2  Results 

The  model  was  trained  on  563  sentences,  and  tested  on 
another  356  sentences.  (That  is,  563/356  sentences  af¬ 
ter  producing  one  sentence  per  indicator  as  described  in 
section  6.1).  The  sentences  were  taken  from  the  “Who’s 
news”  section  of  Wall  Street  Journal,  which  is  almost  ex¬ 
clusively  about  management  successions.  The  training 
sentences  were  taken  from  219  Who’s  News  articles  in 
the  1996  section,  the  test  sentences  were  taken  from  131 
articles  in  the  1995  section.  The  sentence  level  annota¬ 
tion  was  part  of  an  annotation  effort  for  the  full  extraction 
task,  which  therefore  also  marked  the  relevant  corefer¬ 
ence  relationships  and  the  complete  output  template  as 
in  figure  1. 

The  test  data  sentences  always  contain  an  event,  and 
have  all  indicators  marked  as  *indicator*  —  only  those 
indicators  that  have  1  or  more  slots  attached  to  them  are 
marked.  This  is  an  idealization,  in  that  we  avoid  prob¬ 
lems  of  false  positives,  cases  where  a  potential  indicator 
is  not  used  to  express  an  event.  See  section  6.4  for  sug¬ 
gestions  about  how  to  extend  the  model  to  deal  with  false 
positives. 

The  results  are  shown  in  table  4.  We  define  precision 
and  recall  when  comparing  to  the  annotated  test  set  an¬ 
swers  (gold  standard)  as 


Precision 

Recall 


Number  of  correct  slots 
Number  of  slots  proposed 

Number  of  correct  slots 
Number  of  slots  in  the  gold  standard 


In  addition  we  report  the  standard  “F-Measure”,  which  is 
a  combination  of  precision  and  recall 


F-Measure 


2  X  Precision  x  Recall 
Precision  -f  Recall 


The  results  are  quoted  for  the  IN,  OUT  and  POST  slots 
(the  IND  slot  is  not  scored,  as  it  is  marked  in  test  data  and 
would  score  100%  recall/precision,  inflating  the  scores). 
The  number  of  “correct”  slots  varies  depending  on  how 
partial  matches  are  scored  -  a  partial  match  is  where  an 
output  slot  does  not  match  a  gold  standard  slot  exactly, 
but  does  partially  overlap.  For  example,  in 


Mr.  Smith  was  named  president  of  the  com¬ 
pany,  *succeeding*  Fred  Jones. 


Bill  Smith  was  elected  vice  president,  human 
resources. 
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Score  for  partial 

Precision 

Recall 

F-Metisure 

0 

80.6% 

74.6% 

77.5% 

0.5 

85.9% 

79.6% 

82.6% 

1.0 

91.3% 

84.5% 

87.8% 

Table  4:  Results  on  356  test  data  sentences,  training  on 
563  sentences 


the  gold  standard  might  designate  the  slot  as  “vice  pres¬ 
ident,  human  resources”,  whereas  the  program  output 
might  just  mark  “vice  president”.  We  present  three  preci¬ 
sion/recall  scores  —  where  a  partial  match  scores  0,  0.5 
or  1.0,  and 

Number  of  correct  slots  =  Number  of  exact  matches 
-(-Score  for  a  partial  match  x  Number  of  partial  matches 

6.3  Analysis  of  the  results 

In  this  section  we  look  at  the  errors  the  system  makes 
in  more  detail.  There  are  two  categories  of  error:  preci¬ 
sion  errors  (incorrect  slots);  and  recall  errors  (slots  the 
system  failed  to  propose).  For  these  tests  we  ran  ex¬ 
periments  on  the  training  data,  jack-knifing  (i.e.  using 
cross-validation)  it  into  4  sections,  in  each  case  training 
on  three-quarters  of  the  training  set  and  testing  on  the 
other  quarter.  Tables  5  and  6  show  the  results  on  this 
data  set. 


Gold  1 

Proposed 

Correct 

Partial 

Correct 

+Partial 

986 

944 

769 

71 

840 

Table  5:  Results  on  the  jack-knifed  training  set  (Counts) 


Score  for  partial 

Precision 

Recall 

F-Measure 

0 

81.5% 

78.0% 

79.7% 

0.5 

85.2% 

81.6% 

83.4% 

1.0 

89.0% 

85.2% 

87.0% 

Table  6:  Results  on  the  jack-knifed  training  set  (Percent¬ 
ages) 


6.3.1  Precision  errors 

Table  7  shows  the  104  precision  errors  categorized  by 
hand  into  four  categories.  These  four  categories  were; 

1)  Semantically  Plausible.  Here  the  model  has  se¬ 
lected  a  slot-filler  that  looks  good  semantically,  but  is 
ruled  out  for  other  reasons  (usually  syntactic).  For  ex¬ 
ample. 

The  appointment  puts  (IN  Mr.  Zwim)  ,  41 
years  old  ,  in  line  to  succeed  the  unit ’s  pres¬ 
ident  ,  Frank  R.  Bakos ,  58 ,  who  is  (IND  retir¬ 
ing)  at  year  end . 

Here  “Mr.  Zwim”  is  semantically  a  good  filler  for  “retir¬ 
ing”,  but  syntactically  this  is  almost  impossible. 


Error  type 

%age 

Error  sub-type 

%age 

Semantically 

Plausible 

37.1% 

Relative  clauses 
Subject 

Other 

18.1% 

10.5% 

8.6% 

“Correct” 

25.7% 

Good  alternative 
>  1  reference 

17.1% 

8.6% 

Bad  lexical 
information 

8.6% 

Others 

28.6% 

Table  7:  The  percentage  of  errors  in  each  error  category 

We  sub-divided  this  class  into  3  sub-categories:  prob¬ 
lems  with  relative  clauses,  as  in  the  example  above;  prob¬ 
lems  with  non-relativized  subjects,  for  example  “Bran¬ 
don  Sweitzer ,  53  ,  succeeds  (IN  Mr.  Wakefield)  as  pres¬ 
ident  of  Guy  Carpenter  and  also  (IND  becomes)  (POST 
the  unit ’s  CEO) ,  succeeding  Richard  Blum  ,  56  and 
problems  that  fell  outside  these  categories. 

2)  “Correct”.  These  slots  were  not  seen  in  the  gold- 
standard,  but  were  deemed  pretty  much  correct,  in  that 
they  would  not  hurt  (and  might  even  help)  the  score  of  a 
full  system.  They  fall  into  two  sub-categories  -  “good  al¬ 
ternative”,  where  the  model’s  output  is  different  from  the 
gold  standard  but  still  looks  reasonable,  either  because 
the  sentence  has  more  than  one  reasonable  answer,  or  the 
gold  standard  is  simply  wrong;  “>  1  reference”,  where 
there  is  more  than  one  reference  to  the  slot  filler  in  the 
sentence,  and  the  model  has  chosen  a  different  one  from 
the  gold  standard.  For  example, 

(OUT  Mr.  Johnson)  ,  52  ,  said  he  resigned 
(POST  his  positions  as  chief  executive  officer) 

Here  the  model  marked  “Mr.  Johnson”  as  OUT,  the  an¬ 
notator  marked  "he”,  and  both  are  in  some  sense  correct. 

3)  Bad  Lexical  Information.  In  these  cases  the  model 
selected  a  slot  filler  that  is  clearly  bad  for  lexical  reasons, 
for  example 

Mr.  Broeksmit  is  the  (OUT  latest)  in  a  string 
of  employees  to  (IND  leave)  the  firm  ... 

4)  Other.  Miscellaneous  errors  which  do  not  fall  into 
the  above  three  categories. 

6.3.2  Recall  Problems 

Of  the  356  test-set  sentences,  330  (92.7%)  were  pro¬ 
cessed  by  the  system  to  give  some  output  —  no  output 
was  produced  for  26  cases.  This  accounts  for  the  recall 
figures  in  table  4  being  lower  than  the  precision  figures 
(for  example,  with  a  score  of  0  for  partials,  precision  = 
80.6%,  recall  =  74.6%,  and  92.7%*80.6%  =  74.7%).  Of 
these  26  cases,  24  involved  an  indicator  word  that  had 
never  been  seen  in  training  data.  The  other  2  cases  in¬ 
volved  an  unusual  usage  of  “succeed”,  which  had  never 
been  seen  in  training  data  and  was  peculiar  enough  for 
the  system  to  fail  to  get  an  analysis  (we  set  a  probability 
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threshold  such  that  the  machine  gives  up  if  it  fails  to  find 
an  analysis  above  this  probability). 

6.4  Dealing  vrith  False  Positives 
This  work  has  made  a  simplifying  assumption,  that  test 
sentences  were  marked  with  indicators  that  had  one  or 
more  slots.  This  section  considers  how  this  process  could 
be  automated. 

A  first  step  would  be  to  identify  in  test  data  morpho¬ 
logical  variants  of  words  that  had  been  seen  as  indicators 
in  training  data.  However  this  would  inevitably  lead  to 
false  positives  —  that  is,  potential  indicators  appearing 
in  cases  where  they  don’t  indicate  an  event.  We  could  see 
two  potential  approaches  for  filtering  out  these  spurious 
cases:  first,  word-sense  disambiguation  methods  similar 
to  those  in  (Yarowsky  95);  second,  we  could  extend  the 
model  to  have  an  eighth,  empty,  template  as  a  possibility 
—  the  model  should  then  learn  how  often  null  templates 
occur,  and  what  kind  of  lexical  items  tend  to  produce 
them. 

We  leave  this  to  future  work.  At  least  in  this  dataset 
(Who’s  News  articles)  we  believe  that  the  false  positive 
problem  will  not  be  severe,  as  the  articles  contain  infor¬ 
mation  almost  exclusively  on  management  successions, 
and  most  of  the  indicators  are  unambiguous  within  this 
sub-domain. 

The  models  have  also  made  the  assumption  that  an  in¬ 
dicator  is  used  to  express  each  event.  This  may  not  be 
the  case  in  all  information  extraction  tasks,  in  some  there 
may  not  be  clear  indicator  words;  again,  we  leave  dealing 
with  this  limitation  to  future  work. 

7  Future  Work 

We  anticipate  two  directions  for  future  work:  first,  re¬ 
fining  the  current  model  to  improve  its  performance,  and 
second,  extending  the  current  model  to  encompass  the 
complete  information  extraction  task. 

7.1  Refining  the  Model 

When  deciding  on  the  direction  of  future  work,  it  is  use¬ 
ful  to  consider  the  error  analysis  in  table  7.  The  majority 
of  errors  (the  “semantically  plausible”  class)  were  cases 
where  the  model  picked  a  slot  that  was  semantically  plau¬ 
sible,  but  syntactically  impossible.  It  is  unlikely  that  this 
problem  can  be  solved  with  the  approach  described  here, 
even  with  vastly  increased  amounts  of  training  data.  Our 
feeling  is  that  a  full  syntactic  parser  as  a  first  stage  could 
radically  improve  performance.  An  improved  approach 
might  be  to  fully  integrate  the  recovery  of  syntactic  struc¬ 
ture  and  semantic  labelings,  in  a  similar  way  to  the  ap¬ 
proach  used  in  BBN’s  SIFT  system  (Miller  et  al.  98). 

7.2  Extending  the  Model 

As  discussed  in  section  1.1,  the  standard  approach  to 
information  extraction  involves  three  stages  of  process¬ 
ing:  sentence  level  pattern  matching,  coreference,  and 
template  merging.  Of  these  stages,  our  current  work  ad¬ 
dresses  only  sentence  level  pattern  matching.  However, 


we  believe  that  the  generative  statistical  framework  de¬ 
scribed  in  this  paper  could  be  extended  advantageously 
to  the  complete  information  extraction  problem.  In  ex¬ 
tending  the  framework,  we  envision  that  the  information 
extraction  task  would  be  performed  using  an  inverted 
’’information  production”  model. 

We  can  think  of  this  model  as  approximating,  to  some 
degree,  the  process  by  which  text  is  produced  by  an  au¬ 
thor.  Specifically,  we  assume  that  each  message  is  pro¬ 
duced  according  to  a  four  stage  process: 

1)  First,  the  author  decides  what  facts  to  express.  For 
example,  the  text  in  figure  1  can  be  thought  of  as  express¬ 
ing  two  succession  events:  IN  =  “Hensley  E.  West”,  OUT 
=  “John  Bradley”,  POST  =  “president”,  COMPANY  = 
“RESTOR  INDUSTRIES  Inc.”,  and  OUT  =  “Hensley 
E.  West”,  POST  =  “group  vice  president”,  COMPANY 
=  “DSC  Communications  Corp.”.  This  process  can  be 
modeled  as  a  prior  probability  distribution  over  sets  of 
templates.  In  this  example,  the  model  would  give  the 
prior  probability  of  a  message  containing  exactly  two 
succession  templates:  one  containing  slots  IN,  OUT, 
POST,  COMPANY  and  the  other  containing  slots  OUT, 
POST,  COMPANY. 

2)  After  deciding  what  facts  to  express,  the  author 
must  decompose  them  into  one  or  more  component 
events.  For  example,  the  succession  event  IN  =  “Hensley 
E.  West”,  OUT  =  “John  Bradley”,  POST  =  “president”, 
COMPANY  =  “RESTOR  INDUSTRIES  Inc.”  is  decom¬ 
posed  into  two  smaller  events:  IN  =  “Hensley  E.  West”, 
POST  =  “president”,  COMPANY  =  “RESTOR  INDUS¬ 
TRIES  Inc.”  and  IN  =  “Hensley  E.  West”,  OUT  =  “John 
Bradley”.  This  process  can  be  modeled  as  a  probability 
distribution  over  ’’template  splitting  operations”,  condi¬ 
tioned  on  the  full  template  being  expressed.  Template 
splitting  operations  are  thus  the  generative  analogue  of 
the  merging  operations  used  in  most  information  extrac¬ 
tion  systems. 

3)  Next,  each  component  event  must  be  expressed  as  a 
linguistic  pattern.  For  example,  the  event  IN  =  “Hensley 
E.  West”,  POST  =  “president”,  COMPANY  =  “RESTOR 
INDUSTRIES  Inc.”  is  expressed  as  the  linguistic  pattern 
“IN  ...  was  named  POST  of  COMPANY”,  and  the  event 
IN  =  “Hensley  E.  West”,  OUT  =  “John  Bradley”  is  ex¬ 
pressed  as  the  linguistic  pattern  “IN  ...  fills  a  vacancy 
created  by  the  retirement ...  of  OUT”.  This  process  can 
be  modeled  as  a  probability  distribution  over  linguistic 
patterns,  conditioned  on  the  partial  template  being  ex¬ 
pressed.  Modeling  this  distribution  is  the  subject  of  the 
main  body  of  this  paper. 

4)  Finally,  the  entities  involved  in  events  must  be 
realized  as  word  strings  within  patterns.  For  exam¬ 
ple,  “RESTOR  INDUSTRIES  Inc.”  is  realized  as  “this 
telecommunications-product  concern”,  and  “Hensley  E. 
West”  is  realized  as  “Mr.  West”.  This  process  can 
be  modeled  as  a  probability  distribution  over  ’’descrip¬ 
tor  generating  operations”,  conditioned  on  the  entity  be¬ 
ing  expressed  and  other  features  of  the  text.  For  exam- 
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pie,  given  that  the  author  intends  to  express  “Hensley 
E.  West”,  and  given  that  the  full  name  appeals  earlier 
in  the  text,  the  model  would  assign  a  certain  probability 
to  generating  the  word  string  “Mr.  West”.  In  this  case, 
the  descriptor  generating  operation  would  be  [title  +  last 
name]. 

Clearly,  there  are  many  details  that  would  need  to  be 
resolved  before  a  complete  generative  model  of  informa¬ 
tion  extraction  could  be  implemented.  In  this  paper,  we 
have  described  a  model  containing  two  of  the  necessary 
components:  a  prior  model  over  templates,  and  a  model 
of  linguistic  patterns  conditioned  on  those  templates.  A 
complete  generative  model  for  IE  would  offer  two  po¬ 
tentially  powerful  advantages.  First,  the  model  would 
provide  principled  probability  estimates  for  selecting  the 
most  likely  set  of  templates  given  an  input  message: 

T  =  set  of  templates  (the  final  output) 

M=  the  message,  C  =  components 
P  =  linguistic  patterns,  S  =  slot  fillers  in  T 
D  =  descriptions  used  to  express  the  slots 


P(T|M)  =  ^  P(T,  C,  P|M) 
C,P 


where 

P(T,  C,  P|M) 


P(T)  X  P(C|T)  X  P(P|C)  X  P(D|S) 
P(M) 


The  second  potential  advantage  derives  from  the  gen¬ 
erative  aspect  of  the  proposed  model.  While  there  is  an 
analogue  in  conventional  IE  systems  for  each  of  stages  2 
through  4  described  above,  there  is  no  conventional  ana¬ 
logue  to  stage  1 :  the  prior  model.  We  can  think  of  this 
prior  model  as  encoding  domain-specific  world  knowl¬ 
edge  about  the  plausibility  of  proposed  sets  of  relations. 


8  Conclusions 

We  have  shown  that  a  simple  statistical  model  can  iden¬ 
tify  semantic  slot- fillers  in  a  management  succession  task 
with  83%  accuracy  (F-measure  with  a  score  of  0.5  for 
partial  matches).  The  system  was  trained  on  only  560 
sentences,  with  the  additional  requirements  of  only  a 
part-of-speech  tagger  and  a  morphological  analyser.  We 
initially  considered  a  finite-state  approach  similar  to  that 
used  for  POS  tagging  (Church  88),  or  named-entity  iden¬ 
tification  (Bikel  et  al.  97),  but  argued  that  the  Markov 
approximation  gives  a  poor  model  for  this  task.  The  al¬ 
ternative,  which  has  a  PCFG  component  to  define  the 
probability  of  the  underlying  sequence  of  labels,  allows 
a  good  parameterization  of  the  problem,  and  can  be  de¬ 
coded  efficiently  using  the  CKY  algorithm.  Finally,  we 
believe  that  the  framework  presented  in  this  paper  can  be 
extended  to  model  the  complete  information  extraction 
process. 
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